The World Happiness Report is a landmark survey of the state of global happiness.The happiness scores and rankings use data from the Gallup World Poll (GWP). The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale.
Further, the Happiness Report includes additional 6 factors (levels of GDP, life expectancy, generosity, social support, freedom, and corruption) which show the estimated extent to which each of the six factor is estimated to contribute to making life evaluations (happiness score) higher in each country than in Dystopia. The underlying raw datapoints for those estimations are provided by other organisations (e.g. WHO) or from the Gallup World Poll question results. Dystopia in this context, is a hypothetical country with values equal to the world’s lowest national averages for each of the six factors raw values. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables. Since life would be very unpleasant in a country with the world’s lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom, and least social support, it is referred to as “Dystopia,” in contrast to Utopia.
Thus, each of the 6 factors values explain the contribution of each factor for the higher happiness score in a certain country than in Dystopia. That is why the happiness score can be calculated by: \[\sum_{i=1}^{6} factorvalue_i + dystopiahappiness + residual \]
This makes it clear, that the 6 factors are already the result of some sort of estimation and therefore cannot be used for analysing the variable importance. The resulting regression coefficients e.g. would not be helpful at all, as by including the residual in the dataset, the interception would be 0 and all the coefficients would result in 1.
That is why we looked for an additional version of the happiness dataset, which includes the actual raw values and which we can therefore use for analysing the variable importance and use in data dimension reduction steps.
Based on the happiness dataset we want to try to answer the follwing leading questions.
To answer this questions we need the raw values to build our analysis on top. We further decided to add additional factors which might explain the different happiness levels. We were interested in how drug abuse correlates with happiness and found suiting datasets for alcohol consumption and tabacco consumption. Additionally we were intereseted in how the modern user of social media influences happiness. However we only found a fitting internet dataset which captures the percentage of the individuals in a country which is using the Internet.
Our questions are:
How does happiness change over time and can we see certain pattern? To anwer this question we need a change of happiness. We can use the plain happiness dataset, as it captures the happiness scores and the explained by parts for the 6 factors over time. Therefore we can calculate an visualize the happiness changes over time.
For answering our two main questions we decided for the given reasons to create two datasets.
| Country | Region | Happiness.Rank | Happiness.Score | Standard.Error | Economy..GDP.per.Capita. | Family | Health..Life.Expectancy. | Freedom | Trust..Government.Corruption. | Generosity | Dystopia.Residual |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | Western Europe | 1 | 7.587 | 0.03411 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 |
| Iceland | Western Europe | 2 | 7.561 | 0.04884 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 |
| Denmark | Western Europe | 3 | 7.527 | 0.03328 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 |
| Country | Happiness.Rank | Happiness | Economy | Family | Health | Freedom | Trust | Generosity | Year | Region |
|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | 1 | 7.587 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2015 | Western Europe |
| Iceland | 2 | 7.561 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2015 | Western Europe |
| Denmark | 3 | 7.527 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2015 | Western Europe |
For answering the questions “What influences happiness?” we had to use the raw data of the factors and not their “explained by” values. In addition, we wanted to add futher factors and added the following three datasets:
By merging the datasets we have now four additional factors.
To join all the different datasets we had to do some preprocessing which can be seen in the preprocessing step. The main steps where cleaning the data (region, countrycode, NaN) and joining the datasets based on the year and the countrycode.
After joining we noticed, that the three additional data sets do not contain data for the whole timespan 2015-2022.(fig. missing values full data) Therefore, we decided to use only one year for analysing the influential factors.
missing values full data
We inspected the missing values of each year and choose the year with the lowes missing values, year 2018 (fig “missing values 2018”). Then we excluded all rows containing missing values again. Figure “missing values 2017” shows e.g. that the smoking and the alcohol dataset did not contain any values for the year 2017. We also renamed the columns for having shorter labels.
The final influential factors dataset consists of 96 rows (countries for the year 2018) and 18 columns which quickly explained. A more detailed explanation can be seen in the Statistical Appendix of the world happiness report.
| Country | Region | Year | Happiness | Economy | Social | Health | Freedom | Generosity | Corruption | Positive | Negative | Government | Code | Alcohol | Population | Tobacco | Internet |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Albania | Central and Eastern Europe | 2018 | 5.004403 | 9.412399 | 0.6835917 | 68.7 | 0.8242123 | 0.0053850 | 0.8991294 | 0.7132996 | 0.3189967 | 0.4353380 | ALB | 7.17 | 2882735 | 29.2 | 65.40000 |
| Argentina | Latin America and Caribbean | 2018 | 5.792797 | 9.809972 | 0.8999116 | 68.8 | 0.8458947 | -0.2069366 | 0.8552552 | 0.8203097 | 0.3205021 | 0.2613523 | ARG | 9.65 | 44361150 | 21.8 | 77.70000 |
| Armenia | Commonwealth of Independent States | 2018 | 5.062449 | 9.119424 | 0.8144490 | 66.9 | 0.8076437 | -0.1491087 | 0.6768264 | 0.5814877 | 0.4548403 | 0.6708276 | ARM | 5.55 | 2951741 | 26.7 | 68.24505 |
missing values 2017
missing values 2018
One of the objectives of preliminary data analysis to get a feel for the data you are dealing with by describing the key features of the data and summarizing the results. We are focusing on the second dataset, the influential factors dataset, which includes the raw values and not the explained by values.
First we check via the summary how all the explanatory variables are distributed. As we can see they are on different scales, especially “Health”,“Population” and “Internet”. As we don’t want to have data reduction analysis be more driven on the larges distances, we scale them by \(\frac{(x - mean(x))}{sd(x)}\)
## Happiness Economy Social Health Freedom Corruption Generosity Positive Negative Government Alcohol
## Min. :3.335 Min. : 6.630 Min. :0.5035 Min. :48.20 Min. :0.5286 Min. :0.1506 Min. :-0.33638 Min. :0.4347 Min. :0.1580 Min. :0.07971 Min. : 0.019
## 1st Qu.:4.702 1st Qu.: 8.570 1st Qu.:0.7396 1st Qu.:59.85 1st Qu.:0.7245 1st Qu.:0.6849 1st Qu.:-0.14312 1st Qu.:0.6427 1st Qu.:0.2132 1st Qu.:0.33120 1st Qu.: 4.280
## Median :5.536 Median : 9.669 Median :0.8581 Median :66.80 Median :0.8084 Median :0.7989 Median :-0.02550 Median :0.7353 Median :0.2749 Median :0.50385 Median : 7.410
## Mean :5.597 Mean : 9.394 Mean :0.8220 Mean :65.23 Mean :0.7945 Mean :0.7255 Mean :-0.01767 Mean :0.7114 Mean :0.2845 Mean :0.50944 Mean : 7.221
## 3rd Qu.:6.340 3rd Qu.:10.346 3rd Qu.:0.9130 3rd Qu.:71.20 3rd Qu.:0.8784 3rd Qu.:0.8559 3rd Qu.: 0.07377 3rd Qu.:0.8000 3rd Qu.:0.3509 3rd Qu.:0.64084 3rd Qu.:10.570
## Max. :7.858 Max. :11.454 Max. :0.9660 Max. :75.00 Max. :0.9699 Max. :0.9520 Max. : 0.49938 Max. :0.8836 Max. :0.5438 Max. :0.98812 Max. :15.090
## Population Tobacco Internet
## Min. :6.042e+05 Min. : 4.60 Min. : 8.00
## 1st Qu.:6.028e+06 1st Qu.:13.90 1st Qu.:30.80
## Median :1.585e+07 Median :22.80 Median :68.25
## Mean :5.380e+07 Mean :22.21 Mean :59.34
## 3rd Qu.:5.042e+07 3rd Qu.:27.95 3rd Qu.:81.62
## Max. :1.353e+09 Max. :45.50 Max. :97.32
We can see that every factor is now on the same scale. We have some outliers for Corruption, Generosity and Population.
On the correlation matrix plot we see, that happiness has the strongest correlation with Economy (0.801), Internet (0.786), Social (0.768) and Health (0.767). For the correlations between the explanatory variables the following stand out:
In this chapter we try to answer the question “What influences happiness?” by several methods of influential factors analysis.
One tool for getting a first glance on what influences happiness is linear regression. For the regression we use the unscaled data. If our linear model has good predictability, we can interpret the coefficients on how they influence the outcome. This is also called regression analysis, where the goal is to isolate the relationship between each explanatory variable and the outcome variable.
However, the interpretability assumes that you can only change the value of one explanatory variable and not the others at the same time. This of course is only true if there are no correlations between the explanatory variables. If this independence does not hold, we have a problem of multicollinearity. This can result in the coefficients swingging wildly based on which other independent variables are in the model. Therefore the coefficients become very sensitive to small changes in the model and can not be easily interpreted.
One way to asses how strong the explanatory variables are affected by multicollinearity is using the variance inflation factor (VIF). VIFs identify correlations and their strength. VIFs between 1 and 5 suggest that there is a small correlation, VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated.
If we build a linear regression model on all explanatory variables, we get an R-squared of 0.8063. However, by plotting the VIF values we can see that a model based on all explanatory variables has severe multicollinearity. Therefore we can not interprete the coefficients for Internet, Health and Economy.
##
## Call:
## lm(formula = Happiness ~ ., data = not_scaled_data_factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.60190 -0.24719 0.00124 0.28565 1.79684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.527e+00 1.487e+00 -1.027 0.3076
## Economy 3.317e-01 1.628e-01 2.037 0.0449 *
## Social 3.251e+00 9.952e-01 3.266 0.0016 **
## Health 7.641e-03 1.971e-02 0.388 0.6993
## Freedom 1.404e+00 8.833e-01 1.589 0.1159
## Corruption -1.247e+00 4.577e-01 -2.724 0.0079 **
## Generosity 7.633e-01 4.282e-01 1.783 0.0784 .
## Positive 6.045e-01 7.901e-01 0.765 0.4465
## Negative 2.332e+00 9.192e-01 2.537 0.0131 *
## Government -9.855e-01 4.520e-01 -2.180 0.0321 *
## Alcohol -3.825e-03 1.898e-02 -0.202 0.8407
## Population -3.861e-10 4.241e-10 -0.910 0.3654
## Tobacco -1.165e-02 7.295e-03 -1.596 0.1143
## Internet 6.001e-03 7.110e-03 0.844 0.4012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.541 on 81 degrees of freedom
## Multiple R-squared: 0.8063, Adjusted R-squared: 0.7752
## F-statistic: 25.94 on 13 and 81 DF, p-value: < 2.2e-16
If we build a linear regression model without Internet and Economy, we get an R-squared of 0.7745. This R-squared is lower than prior, but after plotting the VIF values we can see that we are allowed to interpret the coefficients for the remaining explanatory variables, as all VIF values are below 5.
Interesting is that only Social, Health, Corruption, Negative and Government are statistically significant:
##
## Call:
## lm(formula = Happiness ~ . - Internet - Economy, data = not_scaled_data_factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63299 -0.30363 -0.02198 0.34810 2.08143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.609e+00 1.357e+00 -1.186 0.238858
## Social 4.646e+00 9.788e-01 4.746 8.54e-06 ***
## Health 5.214e-02 1.626e-02 3.207 0.001908 **
## Freedom 8.769e-01 9.265e-01 0.946 0.346660
## Corruption -1.616e+00 4.667e-01 -3.463 0.000847 ***
## Generosity 4.041e-01 4.430e-01 0.912 0.364406
## Positive 8.449e-01 8.287e-01 1.020 0.310893
## Negative 1.927e+00 9.682e-01 1.990 0.049879 *
## Government -1.016e+00 4.768e-01 -2.132 0.035974 *
## Alcohol 4.612e-03 2.003e-02 0.230 0.818514
## Population -1.135e-10 4.242e-10 -0.268 0.789649
## Tobacco -8.494e-03 7.700e-03 -1.103 0.273137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5766 on 83 degrees of freedom
## Multiple R-squared: 0.7745, Adjusted R-squared: 0.7446
## F-statistic: 25.92 on 11 and 83 DF, p-value: < 2.2e-16
Next we tried out a linear regrssion method with shrinkage. For the lasso regression some estimates can become exactly zero. The result is therfore a type of variable selection and makes the model sparse and easier to interpret. For Lasso regression all predictor variables should be scaled so that they have the same standard deviation. Otherwise, the predictor variables have weighting in the penalty term. The glmnet() function however standardizes the predictors by default and the output coefficients are recalculated to apply to the original scale.
## [1] "Lasso Regression"
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.30404144
## Social 3.22444813
## Health 0.04668353
## Freedom .
## Corruption -0.56261151
## Generosity .
## Positive 0.00795680
## Negative .
## Government .
## Alcohol .
## Population .
## Tobacco .
The results of the lasso regression confirm our results from the normal regression for Social, Health and Corruption. However Positive is added and Negative and Government is removed from the model.
The change in the significant factors between the linear regression and lasso regression can be seen below:
The lasso and the linear regression model agree on the significant factors: Social, Health and Corruption. It seems plausible, that a high value in Social (answer to the questions: “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”) comes with high values of Happiness. The same for Health, even though the coefficients are not as big. Health captures the “Healthy life expectancy at birth from WHO” and therefore is foundation for Happiness. High values on Corruption on the other hand results in an Happiness decrease. Which is also plausible as an unfair environment, either in business or in government, results in an overall decrease in happiness.
Principal component analysis is based on the assumption that, in the case of strongly correlated variables, there is a third variable which cannot be measured directly and which stands behind these correlated variables and, as it were, manifests itself in them. This means that the measurable variables are just another manifestation of variables that are in the background and cannot be measured directly. These background variables are called principal components. The aim of principal component analysis is to identify such background variables or factors from the measured data and to explain the observed relationships as completely as possible.
For the PCA we are using the scaled factors without the happiness score. The first two PCs explain 59.01 % of the variation together.
PC1 explains 39.07 % of the variation and the coefficients are the following:
\[PC1=-0.415*Economy+-0.397*Social+-0.395*Health+-0.174*Freedom+0.192*Corruption \\ +0.115*Generosity+-0.182*Positive+0.317*Negative+0.132*Government+-0.289*Alcohol \\ +0.069*Population+-0.164*Tobacco+-0.411*Internet\]
PC2 explains 19.94% of the variation and the coefficients are the following:
\[PC2=0.059*Economy+0.014*Social+0.054*Health+-0.478*Freedom+0.388*Corruption \\ +-0.396*Generosity+-0.384*Positive+0.108*Negative+-0.467*Government+0.054*Alcohol \\ +-0.078*Population+0.246*Tobacco+0.103*Internet\]
The PCA plot colored by the rounded happiness scores, clusters the countries quite good. The added ellipses give a quick visual idea in which area which happiness levels are located. For low values on PC1 and PC2 we the really high happiness scores. The top 3 countries for 2018 (Finland, Denmark and Switzerland) are all in that region. Also interesting is that most of the countries in the lower left are from ‘Western Europe’, expect of ‘New Zealand’, ‘Australia’ and ‘Canada’ with are from ‘North America and ANZ’. We follow up on this finding in chapter “What further influences happiness?”.
When we move from left to right, the happiness scores decrease. The values 8,7,6,5,4 are quite good separated. Therefore PC1 captures the overall Happiness. However the separation between happiness scores is not exact. Happiness scores 8 and 7 are quite good seperated as the happiness scores 8 are quite close together. Happiness scores 7 and 6 overlapping about 50%, but a separation is still recognizable. Happiness score 6 has two outliers. UZB (Uzbekistan) and BEN(Benin), that is why the ellipse results in this big area. Happiness score 5 has no big outliers, but the overlap between 5 and 6 is even bigger than between 6 and 7. Happiness scores 5 and 4 overlapping again about 50%, therefore a separation is visible. Happiness score 3 however is only consisting out of 3 points and is not good seperated. Happiness score 3 has crossovers with 4,5 and 6. PC1 can therefore separate happier countries better than unhappier ones.
After plotting the two outliers, UZB (Uzbekistan) and BEN(Benin), within their happiness group 6, we still can see that they are outliers in a lot of categories compared to the outer countries in happiness class 6.
Next on we looked at the biplot where we can see how the original variables correlate with the principal components.
For PC1 we have some strong correlations for: Internet, Economy, Health and Social. Alcohol is as well correlated with PC1, but not as strong. Negative is also correlated with PC1, it matches quit good, that high values in Negative results in lower happiness.
For PC2 the strongest correlations are: Positive, Freedom, Generosity, Government and Corruption. The PC2 therefore splits quite good between (Positive, Freedom) and (Corruption). We can also see that visually as the ‘Western Europe’ countries are on the side of (Positive, Freedom).
Nextup, we build the self-organizing map. The results in the SOM mapping match out results from the PCA perfectly. We have the same distribution of the happiness categories (8-3). Via the codes plot we can also further confirm our results as the happiest countries (lower left) have the highest values in Internet, Economy, Social, Health and Freedom, an the lowest values for Corruption. For the unhappiest countries it is the other way round.
SOMs
SOM
Let us now pich up our finding from the pca regarding the grouping of the countries from ‘Western Europe’ and ‘North America and ANZ’ in the lower left (happy region) again. Below we plotted the happiness of the counties by region. We can see a clear difference between the happieness distributions for regions. ‘Western Europe’ and ‘North America and ANZ’ are the happiest regions, ‘Sub-Saharan Africa’ is the region with the unhappiest countries, but has quite a wide spread.
## Warning: Paket 'tidyverse' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'tibble' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'purrr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'stringr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'forcats' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'ggpubr' wurde unter R Version 4.1.3 erstellt
## Warning: Removed 157 rows containing non-finite values (stat_smooth).
Alt Text
geography map (color each country base on the percentage change over time (2015-2022))
## Warning: Paket 'pals' wurde unter R Version 4.1.3 erstellt